How to speed up Python using Cython and Pythran#
Python is great, because it’s easy to program and a lot of advanced functionality is either build in or is available through modules. However, sometimes it is really just too slow.
Here we are going to look at how to use two extensions to Python, which compile python code to C or C++ to greatly speed up your python code.
Pythran serge-sans-paille/pythran
Cython https://cython.org/
Go to https://python4photonics.org/ to get the notebook
The test function#
To demonstrate the speed-ups one can achieve we are going to look at a typical DSP function: making symbol decisions.
The function requirements#
For each symbol in a (noisy) input signal array find the nearest symbol from a given alphabet, using eucledian distance.
Input Parameters:
input signal array
input alphabet array
Output parameters:
output signal with the same shape as the input signal composed of symbols from the alphabet
Timing#
We are going to use the timeit magic function in jupyter for timing our efforst, in more complex code you might want to use a profiler instead
Generate test data#
Lets start of with some easy test data so we now that everything works
# import the base modules we will use
%pylab inline
import numpy as np
from qampy import signals, impairments # convenient way of quickly generating a signals and symbol arrays
%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib
# generate a signal array with some added noise
qpsk = signals.SignalQAMGrayCoded(4, 10, nmodes=1) # generate a single polarization QPSK signal with 100 samples)
qpsk = impairments.change_snr(qpsk, 15) # change the SNR of the signal to 15dB
plt.figure(figsize=(10,10))
plt.plot(qpsk[0].real, qpsk[0].imag, '.') # plot the constellation
plt.plot(qpsk.coded_symbols.real, qpsk.coded_symbols.imag, 'x', ms=10) # plot the alphabet on top
[<matplotlib.lines.Line2D at 0x72c918bd9060>]
Naive Python implementation#
Let us start of by implementing a naive Python implementation
def make_decision_py(signal, alphabet):
out = np.zeros_like(signal)
for i in range(signal.size): # we only do 1D here
dmin = 1000
for j in range(alphabet.size):
d = abs(signal[i]-alphabet[j])
if d < dmin:
idx = j
dmin = d
out[i] = alphabet[idx]
return out
Test that it works
out = make_decision_py(qpsk[0], qpsk.coded_symbols)
print(out)
print(qpsk.symbols[0]) # these are the transmitted symbols
print(qpsk.symbols[0]==out)
[-0.70710678-0.70710678j 0.70710678-0.70710678j 0.70710678+0.70710678j
0.70710678-0.70710678j -0.70710678-0.70710678j 0.70710678+0.70710678j
0.70710678-0.70710678j -0.70710678+0.70710678j -0.70710678-0.70710678j
-0.70710678-0.70710678j]
[-0.70710678-0.70710678j 0.70710678-0.70710678j 0.70710678+0.70710678j
0.70710678-0.70710678j -0.70710678-0.70710678j 0.70710678+0.70710678j
0.70710678-0.70710678j -0.70710678+0.70710678j -0.70710678-0.70710678j
-0.70710678-0.70710678j]
[ True True True True True True True True True True]
Time the code#
Lets do some timing of the code
%timeit make_decision_py(qpsk[0], qpsk.coded_symbols)
12.1 µs ± 470 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
That is still reasonably fast, but it is a very short array and only QPSK let us have a look at some more realistic data
# generate a signal array with some added noise
qam64 = signals.SignalQAMGrayCoded(64, 10**4, nmodes=1) # generate a single polarization 128-QAM signal with 10 000 samples)
qam64 = impairments.change_snr(qam64, 30) # change the SNR of the signal to 30dB
plt.figure(figsize=(10,10))
plt.plot(qam64[0].real, qam64[0].imag, '.') # plot the constellation
plt.plot(qam64.coded_symbols.real, qam64.coded_symbols.imag, 'x', ms=10) # plot the alphabet on top
[<matplotlib.lines.Line2D at 0x72c9161c3f70>]
%timeit make_decision_py(qam64[0], qam64.coded_symbols)
105 ms ± 832 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That is a lot slower and still quite a short array!
Numpy Implementation#
Generally you should always use the numpy vector functions if you can because they will give you a significant speedup. So let us do that
def make_decision_np(signal, alphabet):
out = np.zeros_like(signal)
d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
idx = np.argmin(d, axis=0)
out = alphabet[idx]
return out
out = make_decision_np(qpsk[0], qpsk.coded_symbols)
print(out)
print(qpsk.symbols[0]) # these are the transmitted symbols
print(qpsk.symbols[0]==out)
[-0.70710678-0.70710678j 0.70710678-0.70710678j 0.70710678+0.70710678j
0.70710678-0.70710678j -0.70710678-0.70710678j 0.70710678+0.70710678j
0.70710678-0.70710678j -0.70710678+0.70710678j -0.70710678-0.70710678j
-0.70710678-0.70710678j]
[-0.70710678-0.70710678j 0.70710678-0.70710678j 0.70710678+0.70710678j
0.70710678-0.70710678j -0.70710678-0.70710678j 0.70710678+0.70710678j
0.70710678-0.70710678j -0.70710678+0.70710678j -0.70710678-0.70710678j
-0.70710678-0.70710678j]
[ True True True True True True True True True True]
%timeit make_decision_np(qam64[0], qam64.coded_symbols)
1.27 ms ± 7.29 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
That is a factor of 20 faster!
Speeding up#
OK let’s try to speed up this code some more
I will use two packages here
Cython
Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language. It essentially allows you to write modules in an extended Python dialect that can be compile directly to C and provide significant speed-ups by avoiding most of the Python overheads. Cython has been around for a long time, so it is realtively mature. It is also great for wrapping C-libraries for use with Python.
Pythran
Pythran is an ahead of time compiler for a subset of the Python language, with a focus on scientific computing. It takes a Python module annotated with a few interface description and turns it into a native Python module with the same interface, but (hopefully) faster. Pythran is much younger than Cython, but has been progressing at an impressive pace.
Both Pythran and Cython can be used directly inside jupyter notebooks which we will use here. However, they are really aimed at being used for compiling modules and that is how I usually use them.
Cython#
Lets start with Python. Our starting point is the Numpy code above
import cython # the cython module
%load_ext cython
%%cython
import numpy as np
cimport numpy as np
def make_decision_cy(signal, alphabet):
out = np.zeros_like(signal)
d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
idx = np.argmin(d, axis=0)
out = alphabet[idx]
return out
%timeit make_decision_cy(qam64[0], qam64.coded_symbols)
1.24 ms ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
OK that was rather dissappointing no speed-up compared to the numpy code. However if we read the Cython documentation we find out that we really should be annotating the code to get significant speed-ups.
The first thing we can do is turn off several of the checks that Python usually does
%%cython
import numpy as np
cimport numpy as np
import cython
@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(signal, alphabet):
out = np.zeros_like(signal)
d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
idx = np.argmin(d, axis=0)
out = alphabet[idx]
return out
%timeit make_decision_cy(qam64[0], qam64.coded_symbols)
1.25 ms ± 14 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
So again no increase in speed, but considering the code is compiled what if we add some compiler flags?
%%cython -c=-march=native -c=-O3 -c=-ffast-math -c=-mfpmath=sse -c=-fopenmp
import numpy as np
import numpy as np
cimport numpy as np
import cython
@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(signal, alphabet):
out = np.zeros_like(signal)
d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
idx = np.argmin(d, axis=0)
out = alphabet[idx]
return out
%timeit make_decision_cy(qam64[0], qam64.coded_symbols)
1.25 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Again no speed up. But we haven’t told cython what numeric types we are dealing with so let us add some type annotations
%%cython -c=-march=native -c=-O3 -c=-ffast-math -c=-mfpmath=sse -c=-fopenmp
import numpy as np
cimport numpy as np
import cython
@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(np.ndarray[ndim=1, dtype=np.complex128_t] signal, np.ndarray[ndim=1, dtype=np.complex128_t] alphabet): # two array type annotations
cdef double complex[:] out # this is a memoryview which essentially tells cython this is a block of memory similar to the ndarray annotations above
cdef double[:,:] d
cdef long[:] idx
out = np.zeros_like(signal)
d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
idx = np.argmin(d, axis=0)
out = alphabet[idx]
return np.array(out)
timeit make_decision_cy(qam64[0], qam64.coded_symbols)
1.25 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Again no speed improvement. What is going on?
Let us have a look at the code. Cython provides the --annotate argument, which tells us how the code is compiled
%%cython --annotate
import numpy as np
cimport numpy as np
import cython
@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(np.ndarray[ndim=1, dtype=np.complex128_t] signal, np.ndarray[ndim=1, dtype=np.complex128_t] alphabet): # two array type annotations
cdef double complex[:] out # this is a memoryview which essentially tells cython this is a block of memory similar to the ndarray annotations above
cdef double[:,:] d
cdef long[:] idx
out = np.zeros_like(signal)
d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
idx = np.argmin(d, axis=0)
out = alphabet[idx]
return np.array(out)
Generated by Cython 3.0.8
Yellow lines hint at Python interaction.
Click on a line that starts with a "+" to see the C code that Cython generated for it.
+01: import numpy as np
__pyx_t_7 = __Pyx_ImportDottedModule(__pyx_n_s_numpy, NULL); if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 1, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_7); if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_7) < 0) __PYX_ERR(1, 1, __pyx_L1_error) __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0; /* … */ __pyx_t_7 = __Pyx_PyDict_NewPresized(0); if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 1, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_7); if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_7) < 0) __PYX_ERR(1, 1, __pyx_L1_error) __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
02: cimport numpy as np
03: import cython
04:
+05: @cython.boundscheck(False) # won't check that index is in bounds of array
/* Python wrapper */
static PyObject *__pyx_pw_54_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a_1make_decision_cy(PyObject *__pyx_self,
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
); /*proto*/
static PyMethodDef __pyx_mdef_54_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a_1make_decision_cy = {"make_decision_cy", (PyCFunction)(void*)(__Pyx_PyCFunction_FastCallWithKeywords)__pyx_pw_54_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a_1make_decision_cy, __Pyx_METH_FASTCALL|METH_KEYWORDS, 0};
static PyObject *__pyx_pw_54_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a_1make_decision_cy(PyObject *__pyx_self,
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
) {
PyArrayObject *__pyx_v_signal = 0;
PyArrayObject *__pyx_v_alphabet = 0;
#if !CYTHON_METH_FASTCALL
CYTHON_UNUSED Py_ssize_t __pyx_nargs;
#endif
CYTHON_UNUSED PyObject *const *__pyx_kwvalues;
PyObject *__pyx_r = 0;
__Pyx_RefNannyDeclarations
__Pyx_RefNannySetupContext("make_decision_cy (wrapper)", 0);
#if !CYTHON_METH_FASTCALL
#if CYTHON_ASSUME_SAFE_MACROS
__pyx_nargs = PyTuple_GET_SIZE(__pyx_args);
#else
__pyx_nargs = PyTuple_Size(__pyx_args); if (unlikely(__pyx_nargs < 0)) return NULL;
#endif
#endif
__pyx_kwvalues = __Pyx_KwValues_FASTCALL(__pyx_args, __pyx_nargs);
{
PyObject **__pyx_pyargnames[] = {&__pyx_n_s_signal,&__pyx_n_s_alphabet,0};
PyObject* values[2] = {0,0};
if (__pyx_kwds) {
Py_ssize_t kw_args;
switch (__pyx_nargs) {
case 2: values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
CYTHON_FALLTHROUGH;
case 1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
CYTHON_FALLTHROUGH;
case 0: break;
default: goto __pyx_L5_argtuple_error;
}
kw_args = __Pyx_NumKwargs_FASTCALL(__pyx_kwds);
switch (__pyx_nargs) {
case 0:
if (likely((values[0] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_signal)) != 0)) {
(void)__Pyx_Arg_NewRef_FASTCALL(values[0]);
kw_args--;
}
else if (unlikely(PyErr_Occurred())) __PYX_ERR(1, 5, __pyx_L3_error)
else goto __pyx_L5_argtuple_error;
CYTHON_FALLTHROUGH;
case 1:
if (likely((values[1] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_alphabet)) != 0)) {
(void)__Pyx_Arg_NewRef_FASTCALL(values[1]);
kw_args--;
}
else if (unlikely(PyErr_Occurred())) __PYX_ERR(1, 5, __pyx_L3_error)
else {
__Pyx_RaiseArgtupleInvalid("make_decision_cy", 1, 2, 2, 1); __PYX_ERR(1, 5, __pyx_L3_error)
}
}
if (unlikely(kw_args > 0)) {
const Py_ssize_t kwd_pos_args = __pyx_nargs;
if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_kwvalues, __pyx_pyargnames, 0, values + 0, kwd_pos_args, "make_decision_cy") < 0)) __PYX_ERR(1, 5, __pyx_L3_error)
}
} else if (unlikely(__pyx_nargs != 2)) {
goto __pyx_L5_argtuple_error;
} else {
values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
}
__pyx_v_signal = ((PyArrayObject *)values[0]);
__pyx_v_alphabet = ((PyArrayObject *)values[1]);
}
goto __pyx_L6_skip;
__pyx_L5_argtuple_error:;
__Pyx_RaiseArgtupleInvalid("make_decision_cy", 1, 2, 2, __pyx_nargs); __PYX_ERR(1, 5, __pyx_L3_error)
__pyx_L6_skip:;
goto __pyx_L4_argument_unpacking_done;
__pyx_L3_error:;
{
Py_ssize_t __pyx_temp;
for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
__Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
}
}
__Pyx_AddTraceback("_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a.make_decision_cy", __pyx_clineno, __pyx_lineno, __pyx_filename);
__Pyx_RefNannyFinishContext();
return NULL;
__pyx_L4_argument_unpacking_done:;
if (unlikely(!__Pyx_ArgTypeTest(((PyObject *)__pyx_v_signal), __pyx_ptype_5numpy_ndarray, 1, "signal", 0))) __PYX_ERR(1, 9, __pyx_L1_error)
if (unlikely(!__Pyx_ArgTypeTest(((PyObject *)__pyx_v_alphabet), __pyx_ptype_5numpy_ndarray, 1, "alphabet", 0))) __PYX_ERR(1, 9, __pyx_L1_error)
__pyx_r = __pyx_pf_54_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a_make_decision_cy(__pyx_self, __pyx_v_signal, __pyx_v_alphabet);
int __pyx_lineno = 0;
const char *__pyx_filename = NULL;
int __pyx_clineno = 0;
/* function exit code */
goto __pyx_L0;
__pyx_L1_error:;
__pyx_r = NULL;
__pyx_L0:;
{
Py_ssize_t __pyx_temp;
for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
__Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
}
}
__Pyx_RefNannyFinishContext();
return __pyx_r;
}
static PyObject *__pyx_pf_54_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a_make_decision_cy(CYTHON_UNUSED PyObject *__pyx_self, PyArrayObject *__pyx_v_signal, PyArrayObject *__pyx_v_alphabet) {
__Pyx_memviewslice __pyx_v_out = { 0, 0, { 0 }, { 0 }, { 0 } };
__Pyx_memviewslice __pyx_v_d = { 0, 0, { 0 }, { 0 }, { 0 } };
__Pyx_memviewslice __pyx_v_idx = { 0, 0, { 0 }, { 0 }, { 0 } };
__Pyx_LocalBuf_ND __pyx_pybuffernd_alphabet;
__Pyx_Buffer __pyx_pybuffer_alphabet;
__Pyx_LocalBuf_ND __pyx_pybuffernd_signal;
__Pyx_Buffer __pyx_pybuffer_signal;
PyObject *__pyx_r = NULL;
__pyx_pybuffer_signal.pybuffer.buf = NULL;
__pyx_pybuffer_signal.refcount = 0;
__pyx_pybuffernd_signal.data = NULL;
__pyx_pybuffernd_signal.rcbuffer = &__pyx_pybuffer_signal;
__pyx_pybuffer_alphabet.pybuffer.buf = NULL;
__pyx_pybuffer_alphabet.refcount = 0;
__pyx_pybuffernd_alphabet.data = NULL;
__pyx_pybuffernd_alphabet.rcbuffer = &__pyx_pybuffer_alphabet;
{
__Pyx_BufFmt_StackElem __pyx_stack[1];
if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_signal.rcbuffer->pybuffer, (PyObject*)__pyx_v_signal, &__Pyx_TypeInfo___pyx_t_double_complex, PyBUF_FORMAT| PyBUF_STRIDES, 1, 0, __pyx_stack) == -1)) __PYX_ERR(1, 5, __pyx_L1_error)
}
__pyx_pybuffernd_signal.diminfo[0].strides = __pyx_pybuffernd_signal.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_signal.diminfo[0].shape = __pyx_pybuffernd_signal.rcbuffer->pybuffer.shape[0];
{
__Pyx_BufFmt_StackElem __pyx_stack[1];
if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_alphabet.rcbuffer->pybuffer, (PyObject*)__pyx_v_alphabet, &__Pyx_TypeInfo___pyx_t_double_complex, PyBUF_FORMAT| PyBUF_STRIDES, 1, 0, __pyx_stack) == -1)) __PYX_ERR(1, 5, __pyx_L1_error)
}
__pyx_pybuffernd_alphabet.diminfo[0].strides = __pyx_pybuffernd_alphabet.rcbuffer->pybuffer.strides[0]; __pyx_pybuffernd_alphabet.diminfo[0].shape = __pyx_pybuffernd_alphabet.rcbuffer->pybuffer.shape[0];
/* … */
/* function exit code */
__pyx_L1_error:;
__Pyx_XDECREF(__pyx_t_1);
__Pyx_XDECREF(__pyx_t_2);
__Pyx_XDECREF(__pyx_t_3);
__PYX_XCLEAR_MEMVIEW(&__pyx_t_5, 1);
__Pyx_XDECREF(__pyx_t_6);
__Pyx_XDECREF(__pyx_t_7);
__PYX_XCLEAR_MEMVIEW(&__pyx_t_8, 1);
__PYX_XCLEAR_MEMVIEW(&__pyx_t_9, 1);
{ PyObject *__pyx_type, *__pyx_value, *__pyx_tb;
__Pyx_PyThreadState_declare
__Pyx_PyThreadState_assign
__Pyx_ErrFetch(&__pyx_type, &__pyx_value, &__pyx_tb);
__Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_alphabet.rcbuffer->pybuffer);
__Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_signal.rcbuffer->pybuffer);
__Pyx_ErrRestore(__pyx_type, __pyx_value, __pyx_tb);}
__Pyx_AddTraceback("_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a.make_decision_cy", __pyx_clineno, __pyx_lineno, __pyx_filename);
__pyx_r = NULL;
goto __pyx_L2;
__pyx_L0:;
__Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_alphabet.rcbuffer->pybuffer);
__Pyx_SafeReleaseBuffer(&__pyx_pybuffernd_signal.rcbuffer->pybuffer);
__pyx_L2:;
__PYX_XCLEAR_MEMVIEW(&__pyx_v_out, 1);
__PYX_XCLEAR_MEMVIEW(&__pyx_v_d, 1);
__PYX_XCLEAR_MEMVIEW(&__pyx_v_idx, 1);
__Pyx_XGIVEREF(__pyx_r);
__Pyx_RefNannyFinishContext();
return __pyx_r;
}
/* … */
__pyx_tuple__22 = PyTuple_Pack(5, __pyx_n_s_signal, __pyx_n_s_alphabet, __pyx_n_s_out, __pyx_n_s_d, __pyx_n_s_idx); if (unlikely(!__pyx_tuple__22)) __PYX_ERR(1, 5, __pyx_L1_error)
__Pyx_GOTREF(__pyx_tuple__22);
__Pyx_GIVEREF(__pyx_tuple__22);
/* … */
__pyx_t_7 = __Pyx_CyFunction_New(&__pyx_mdef_54_cython_magic_7806802036a512ad86f340b76490f6f1bae1db6a_1make_decision_cy, 0, __pyx_n_s_make_decision_cy, NULL, __pyx_n_s_cython_magic_7806802036a512ad86, __pyx_d, ((PyObject *)__pyx_codeobj__23)); if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 5, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_7);
if (PyDict_SetItem(__pyx_d, __pyx_n_s_make_decision_cy, __pyx_t_7) < 0) __PYX_ERR(1, 5, __pyx_L1_error)
__Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
06: @cython.wraparound(False) # array[-1] won't work
07: @cython.nonecheck(False) # variables are never set to None
08: @cython.cdivision(True) # don't protect against dividing by zero
09: def make_decision_cy(np.ndarray[ndim=1, dtype=np.complex128_t] signal, np.ndarray[ndim=1, dtype=np.complex128_t] alphabet): # two array type annotations
10: cdef double complex[:] out # this is a memoryview which essentially tells cython this is a block of memory similar to the ndarray annotations above
11: cdef double[:,:] d
12: cdef long[:] idx
+13: out = np.zeros_like(signal)
__Pyx_GetModuleGlobalName(__pyx_t_2, __pyx_n_s_np); if (unlikely(!__pyx_t_2)) __PYX_ERR(1, 13, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_2); __pyx_t_3 = __Pyx_PyObject_GetAttrStr(__pyx_t_2, __pyx_n_s_zeros_like); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 13, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_3); __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0; __pyx_t_2 = NULL; __pyx_t_4 = 0; #if CYTHON_UNPACK_METHODS if (unlikely(PyMethod_Check(__pyx_t_3))) { __pyx_t_2 = PyMethod_GET_SELF(__pyx_t_3); if (likely(__pyx_t_2)) { PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_3); __Pyx_INCREF(__pyx_t_2); __Pyx_INCREF(function); __Pyx_DECREF_SET(__pyx_t_3, function); __pyx_t_4 = 1; } } #endif { PyObject *__pyx_callargs[2] = {__pyx_t_2, ((PyObject *)__pyx_v_signal)}; __pyx_t_1 = __Pyx_PyObject_FastCall(__pyx_t_3, __pyx_callargs+1-__pyx_t_4, 1+__pyx_t_4); __Pyx_XDECREF(__pyx_t_2); __pyx_t_2 = 0; if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 13, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_1); __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0; } __pyx_t_5 = __Pyx_PyObject_to_MemoryviewSlice_ds___pyx_t_double_complex(__pyx_t_1, PyBUF_WRITABLE); if (unlikely(!__pyx_t_5.memview)) __PYX_ERR(1, 13, __pyx_L1_error) __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0; __pyx_v_out = __pyx_t_5; __pyx_t_5.memview = NULL; __pyx_t_5.data = NULL;
+14: d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
__Pyx_GetModuleGlobalName(__pyx_t_3, __pyx_n_s_np); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 14, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_3); __pyx_t_2 = __Pyx_PyObject_GetAttrStr(__pyx_t_3, __pyx_n_s_abs); if (unlikely(!__pyx_t_2)) __PYX_ERR(1, 14, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_2); __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0; __Pyx_GetModuleGlobalName(__pyx_t_3, __pyx_n_s_np); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 14, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_3); __pyx_t_6 = __Pyx_PyObject_GetAttrStr(__pyx_t_3, __pyx_n_s_newaxis); if (unlikely(!__pyx_t_6)) __PYX_ERR(1, 14, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_6); __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0; __pyx_t_3 = PyTuple_New(2); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 14, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_3); __Pyx_GIVEREF(__pyx_t_6); if (__Pyx_PyTuple_SET_ITEM(__pyx_t_3, 0, __pyx_t_6)) __PYX_ERR(1, 14, __pyx_L1_error); __Pyx_INCREF(__pyx_slice__5); __Pyx_GIVEREF(__pyx_slice__5); if (__Pyx_PyTuple_SET_ITEM(__pyx_t_3, 1, __pyx_slice__5)) __PYX_ERR(1, 14, __pyx_L1_error); __pyx_t_6 = 0; __pyx_t_6 = __Pyx_PyObject_GetItem(((PyObject *)__pyx_v_signal), __pyx_t_3); if (unlikely(!__pyx_t_6)) __PYX_ERR(1, 14, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_6); __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0; __Pyx_GetModuleGlobalName(__pyx_t_3, __pyx_n_s_np); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 14, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_3); __pyx_t_7 = __Pyx_PyObject_GetAttrStr(__pyx_t_3, __pyx_n_s_newaxis); if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 14, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_7); __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0; __pyx_t_3 = PyTuple_New(2); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 14, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_3); __Pyx_INCREF(__pyx_slice__5); __Pyx_GIVEREF(__pyx_slice__5); if (__Pyx_PyTuple_SET_ITEM(__pyx_t_3, 0, __pyx_slice__5)) __PYX_ERR(1, 14, __pyx_L1_error); __Pyx_GIVEREF(__pyx_t_7); if (__Pyx_PyTuple_SET_ITEM(__pyx_t_3, 1, __pyx_t_7)) __PYX_ERR(1, 14, __pyx_L1_error); __pyx_t_7 = 0; __pyx_t_7 = __Pyx_PyObject_GetItem(((PyObject *)__pyx_v_alphabet), __pyx_t_3); if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 14, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_7); __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0; __pyx_t_3 = PyNumber_Subtract(__pyx_t_6, __pyx_t_7); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 14, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_3); __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0; __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0; __pyx_t_7 = NULL; __pyx_t_4 = 0; #if CYTHON_UNPACK_METHODS if (unlikely(PyMethod_Check(__pyx_t_2))) { __pyx_t_7 = PyMethod_GET_SELF(__pyx_t_2); if (likely(__pyx_t_7)) { PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_2); __Pyx_INCREF(__pyx_t_7); __Pyx_INCREF(function); __Pyx_DECREF_SET(__pyx_t_2, function); __pyx_t_4 = 1; } } #endif { PyObject *__pyx_callargs[2] = {__pyx_t_7, __pyx_t_3}; __pyx_t_1 = __Pyx_PyObject_FastCall(__pyx_t_2, __pyx_callargs+1-__pyx_t_4, 1+__pyx_t_4); __Pyx_XDECREF(__pyx_t_7); __pyx_t_7 = 0; __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0; if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 14, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_1); __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0; } __pyx_t_8 = __Pyx_PyObject_to_MemoryviewSlice_dsds_double(__pyx_t_1, PyBUF_WRITABLE); if (unlikely(!__pyx_t_8.memview)) __PYX_ERR(1, 14, __pyx_L1_error) __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0; __pyx_v_d = __pyx_t_8; __pyx_t_8.memview = NULL; __pyx_t_8.data = NULL;
+15: idx = np.argmin(d, axis=0)
__Pyx_GetModuleGlobalName(__pyx_t_1, __pyx_n_s_np); if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 15, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_1); __pyx_t_2 = __Pyx_PyObject_GetAttrStr(__pyx_t_1, __pyx_n_s_argmin); if (unlikely(!__pyx_t_2)) __PYX_ERR(1, 15, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_2); __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0; __pyx_t_1 = __pyx_memoryview_fromslice(__pyx_v_d, 2, (PyObject *(*)(char *)) __pyx_memview_get_double, (int (*)(char *, PyObject *)) __pyx_memview_set_double, 0);; if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 15, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_1); __pyx_t_3 = PyTuple_New(1); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 15, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_3); __Pyx_GIVEREF(__pyx_t_1); if (__Pyx_PyTuple_SET_ITEM(__pyx_t_3, 0, __pyx_t_1)) __PYX_ERR(1, 15, __pyx_L1_error); __pyx_t_1 = 0; __pyx_t_1 = __Pyx_PyDict_NewPresized(1); if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 15, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_1); if (PyDict_SetItem(__pyx_t_1, __pyx_n_s_axis, __pyx_int_0) < 0) __PYX_ERR(1, 15, __pyx_L1_error) __pyx_t_7 = __Pyx_PyObject_Call(__pyx_t_2, __pyx_t_3, __pyx_t_1); if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 15, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_7); __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0; __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0; __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0; __pyx_t_9 = __Pyx_PyObject_to_MemoryviewSlice_ds_long(__pyx_t_7, PyBUF_WRITABLE); if (unlikely(!__pyx_t_9.memview)) __PYX_ERR(1, 15, __pyx_L1_error) __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0; __pyx_v_idx = __pyx_t_9; __pyx_t_9.memview = NULL; __pyx_t_9.data = NULL;
+16: out = alphabet[idx]
__pyx_t_7 = __pyx_memoryview_fromslice(__pyx_v_idx, 1, (PyObject *(*)(char *)) __pyx_memview_get_long, (int (*)(char *, PyObject *)) __pyx_memview_set_long, 0);; if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 16, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_7); __pyx_t_1 = __Pyx_PyObject_GetItem(((PyObject *)__pyx_v_alphabet), __pyx_t_7); if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 16, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_1); __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0; __pyx_t_5 = __Pyx_PyObject_to_MemoryviewSlice_ds___pyx_t_double_complex(__pyx_t_1, PyBUF_WRITABLE); if (unlikely(!__pyx_t_5.memview)) __PYX_ERR(1, 16, __pyx_L1_error) __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0; __PYX_XCLEAR_MEMVIEW(&__pyx_v_out, 1); __pyx_v_out = __pyx_t_5; __pyx_t_5.memview = NULL; __pyx_t_5.data = NULL;
+17: return np.array(out)
__Pyx_XDECREF(__pyx_r); __Pyx_GetModuleGlobalName(__pyx_t_7, __pyx_n_s_np); if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 17, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_7); __pyx_t_3 = __Pyx_PyObject_GetAttrStr(__pyx_t_7, __pyx_n_s_array); if (unlikely(!__pyx_t_3)) __PYX_ERR(1, 17, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_3); __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0; __pyx_t_7 = __pyx_memoryview_fromslice(__pyx_v_out, 1, (PyObject *(*)(char *)) __pyx_memview_get___pyx_t_double_complex, (int (*)(char *, PyObject *)) __pyx_memview_set___pyx_t_double_complex, 0);; if (unlikely(!__pyx_t_7)) __PYX_ERR(1, 17, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_7); __pyx_t_2 = NULL; __pyx_t_4 = 0; #if CYTHON_UNPACK_METHODS if (unlikely(PyMethod_Check(__pyx_t_3))) { __pyx_t_2 = PyMethod_GET_SELF(__pyx_t_3); if (likely(__pyx_t_2)) { PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_3); __Pyx_INCREF(__pyx_t_2); __Pyx_INCREF(function); __Pyx_DECREF_SET(__pyx_t_3, function); __pyx_t_4 = 1; } } #endif { PyObject *__pyx_callargs[2] = {__pyx_t_2, __pyx_t_7}; __pyx_t_1 = __Pyx_PyObject_FastCall(__pyx_t_3, __pyx_callargs+1-__pyx_t_4, 1+__pyx_t_4); __Pyx_XDECREF(__pyx_t_2); __pyx_t_2 = 0; __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0; if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 17, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_1); __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0; } __pyx_r = __pyx_t_1; __pyx_t_1 = 0; goto __pyx_L0;
Yellow above indicates access to Python internals (which are slow). So there is our reason, we are still just using python (numpy) functions. So to speed this up, we have to exchange the numpy functions with C functions. Maybe it’s better if we start from the Python function?
%%cython -c=-march=native -c=-O3 -c=-ffast-math -c=-mfpmath=sse -c=-fopenmp
#-c=-march=native -c=-O3 -c=-ffast-math -c=-mfpmath=sse -c=-fopenmp
import numpy as np
cimport numpy as np
import cython
@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(np.ndarray[ndim=1, dtype=np.complex128_t] signal, np.ndarray[ndim=1, dtype=np.complex128_t] alphabet):
cdef double complex[:] out
cdef int i, j, idx, N, M
cdef double d, dmin
N = signal.size
M = alphabet.size
out = np.zeros_like(signal)
for i in range(N):
dmin = 1000.
idx = 0
for j in range(M):
d = abs(signal[i]-alphabet[j])
if d < dmin:
idx = j
dmin = d
out[i] = idx
return np.array(out)
timeit make_decision_cy(qam64[0], qam64.coded_symbols)
1.46 ms ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Finally a speed improvement! Not a lot though. The reason is that we are still calling a Python function inside the loop (abs). Let us use the C-function instead
%%cython -c=-march=native -c=-O3 -c=-ffast-math -c=-mfpmath=sse -c=-fopenmp
import numpy as np
cimport numpy as np
cimport cython
cdef extern from "complex.h" nogil:
double cabs(double complex)
@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(double complex[:] signal, double complex[:] alphabet):
cdef double complex[:] out
cdef int i, j, idx, N, M
cdef double d, dmin
N = signal.size
M = alphabet.size
out = np.zeros_like(signal)
for i in range(N):
dmin = 1000.
idx = 0
for j in range(M):
d = cabs(signal[i]-alphabet[j])
if d < dmin:
idx = j
dmin = d
out[i] = idx
return np.array(out)
timeit make_decision_cy(qam64[0], qam64.coded_symbols)
1.44 ms ± 773 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
OK now we have an interesting speed improvement another factor >10 compared to the numpy code. We can actually get even more by using parallel processing. Cython loops that do not use python code can release the GIL
%%cython -c=-march=native -c=-O3 -c=-ffast-math -c=-mfpmath=sse -c=-fopenmp --link-args=-fopenmp
import numpy as np
cimport numpy as np
cimport cython
from cython.parallel import prange
cdef extern from "complex.h" nogil:
double cabs(double complex)
@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(double complex[:] signal, double complex[:] alphabet):
cdef double complex[:] out
cdef int i, j, idx, N, M
cdef double d, dmin
N = signal.size
M = alphabet.size
out = np.zeros_like(signal)
for i in prange(N, nogil=True):
dmin = 1000.
idx = 0
for j in range(M):
d = cabs(signal[i]-alphabet[j])
if d < dmin:
idx = j
dmin = d
out[i] = idx
return np.array(out)
timeit make_decision_cy(qam64[0], qam64.coded_symbols)
374 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Another speed up almost a factor of 100 compared to the numpy code! Let us check what the annotated code looks like now
%%cython --annotate
import numpy as np
cimport numpy as np
cimport cython
from cython.parallel import prange
cdef extern from "complex.h" nogil:
double cabs(double complex)
@cython.boundscheck(False) # won't check that index is in bounds of array
@cython.wraparound(False) # array[-1] won't work
@cython.nonecheck(False) # variables are never set to None
@cython.cdivision(True) # don't protect against dividing by zero
def make_decision_cy(double complex[:] signal, double complex[:] alphabet):
cdef double complex[:] out
cdef int i, j, idx, N, M
cdef double d, dmin
N = signal.size
M = alphabet.size
out = np.zeros_like(signal)
for i in prange(N, nogil=True):
dmin = 1000.
idx = 0
for j in range(M):
d = cabs(signal[i]-alphabet[j])
if d < dmin:
idx = j
dmin = d
out[i] = idx
return np.array(out)
Generated by Cython 3.0.8
Yellow lines hint at Python interaction.
Click on a line that starts with a "+" to see the C code that Cython generated for it.
+01: import numpy as np
__pyx_t_7 = __Pyx_ImportDottedModule(__pyx_n_s_numpy, NULL); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 1, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_7); if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_7) < 0) __PYX_ERR(0, 1, __pyx_L1_error) __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0; /* … */ __pyx_t_7 = __Pyx_PyDict_NewPresized(0); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 1, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_7); if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_7) < 0) __PYX_ERR(0, 1, __pyx_L1_error) __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
02: cimport numpy as np
03: cimport cython
04: from cython.parallel import prange
05:
06: cdef extern from "complex.h" nogil:
07: double cabs(double complex)
08:
09:
+10: @cython.boundscheck(False) # won't check that index is in bounds of array
/* Python wrapper */
static PyObject *__pyx_pw_54_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13_1make_decision_cy(PyObject *__pyx_self,
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
); /*proto*/
static PyMethodDef __pyx_mdef_54_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13_1make_decision_cy = {"make_decision_cy", (PyCFunction)(void*)(__Pyx_PyCFunction_FastCallWithKeywords)__pyx_pw_54_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13_1make_decision_cy, __Pyx_METH_FASTCALL|METH_KEYWORDS, 0};
static PyObject *__pyx_pw_54_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13_1make_decision_cy(PyObject *__pyx_self,
#if CYTHON_METH_FASTCALL
PyObject *const *__pyx_args, Py_ssize_t __pyx_nargs, PyObject *__pyx_kwds
#else
PyObject *__pyx_args, PyObject *__pyx_kwds
#endif
) {
__Pyx_memviewslice __pyx_v_signal = { 0, 0, { 0 }, { 0 }, { 0 } };
__Pyx_memviewslice __pyx_v_alphabet = { 0, 0, { 0 }, { 0 }, { 0 } };
#if !CYTHON_METH_FASTCALL
CYTHON_UNUSED Py_ssize_t __pyx_nargs;
#endif
CYTHON_UNUSED PyObject *const *__pyx_kwvalues;
PyObject *__pyx_r = 0;
__Pyx_RefNannyDeclarations
__Pyx_RefNannySetupContext("make_decision_cy (wrapper)", 0);
#if !CYTHON_METH_FASTCALL
#if CYTHON_ASSUME_SAFE_MACROS
__pyx_nargs = PyTuple_GET_SIZE(__pyx_args);
#else
__pyx_nargs = PyTuple_Size(__pyx_args); if (unlikely(__pyx_nargs < 0)) return NULL;
#endif
#endif
__pyx_kwvalues = __Pyx_KwValues_FASTCALL(__pyx_args, __pyx_nargs);
{
PyObject **__pyx_pyargnames[] = {&__pyx_n_s_signal,&__pyx_n_s_alphabet,0};
PyObject* values[2] = {0,0};
if (__pyx_kwds) {
Py_ssize_t kw_args;
switch (__pyx_nargs) {
case 2: values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
CYTHON_FALLTHROUGH;
case 1: values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
CYTHON_FALLTHROUGH;
case 0: break;
default: goto __pyx_L5_argtuple_error;
}
kw_args = __Pyx_NumKwargs_FASTCALL(__pyx_kwds);
switch (__pyx_nargs) {
case 0:
if (likely((values[0] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_signal)) != 0)) {
(void)__Pyx_Arg_NewRef_FASTCALL(values[0]);
kw_args--;
}
else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 10, __pyx_L3_error)
else goto __pyx_L5_argtuple_error;
CYTHON_FALLTHROUGH;
case 1:
if (likely((values[1] = __Pyx_GetKwValue_FASTCALL(__pyx_kwds, __pyx_kwvalues, __pyx_n_s_alphabet)) != 0)) {
(void)__Pyx_Arg_NewRef_FASTCALL(values[1]);
kw_args--;
}
else if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 10, __pyx_L3_error)
else {
__Pyx_RaiseArgtupleInvalid("make_decision_cy", 1, 2, 2, 1); __PYX_ERR(0, 10, __pyx_L3_error)
}
}
if (unlikely(kw_args > 0)) {
const Py_ssize_t kwd_pos_args = __pyx_nargs;
if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_kwvalues, __pyx_pyargnames, 0, values + 0, kwd_pos_args, "make_decision_cy") < 0)) __PYX_ERR(0, 10, __pyx_L3_error)
}
} else if (unlikely(__pyx_nargs != 2)) {
goto __pyx_L5_argtuple_error;
} else {
values[0] = __Pyx_Arg_FASTCALL(__pyx_args, 0);
values[1] = __Pyx_Arg_FASTCALL(__pyx_args, 1);
}
__pyx_v_signal = __Pyx_PyObject_to_MemoryviewSlice_ds___pyx_t_double_complex(values[0], PyBUF_WRITABLE); if (unlikely(!__pyx_v_signal.memview)) __PYX_ERR(0, 14, __pyx_L3_error)
__pyx_v_alphabet = __Pyx_PyObject_to_MemoryviewSlice_ds___pyx_t_double_complex(values[1], PyBUF_WRITABLE); if (unlikely(!__pyx_v_alphabet.memview)) __PYX_ERR(0, 14, __pyx_L3_error)
}
goto __pyx_L6_skip;
__pyx_L5_argtuple_error:;
__Pyx_RaiseArgtupleInvalid("make_decision_cy", 1, 2, 2, __pyx_nargs); __PYX_ERR(0, 10, __pyx_L3_error)
__pyx_L6_skip:;
goto __pyx_L4_argument_unpacking_done;
__pyx_L3_error:;
{
Py_ssize_t __pyx_temp;
for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
__Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
}
}
__PYX_XCLEAR_MEMVIEW(&__pyx_v_signal, 1);
__PYX_XCLEAR_MEMVIEW(&__pyx_v_alphabet, 1);
__Pyx_AddTraceback("_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13.make_decision_cy", __pyx_clineno, __pyx_lineno, __pyx_filename);
__Pyx_RefNannyFinishContext();
return NULL;
__pyx_L4_argument_unpacking_done:;
__pyx_r = __pyx_pf_54_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13_make_decision_cy(__pyx_self, __pyx_v_signal, __pyx_v_alphabet);
int __pyx_lineno = 0;
const char *__pyx_filename = NULL;
int __pyx_clineno = 0;
/* function exit code */
__PYX_XCLEAR_MEMVIEW(&__pyx_v_signal, 1);
__PYX_XCLEAR_MEMVIEW(&__pyx_v_alphabet, 1);
{
Py_ssize_t __pyx_temp;
for (__pyx_temp=0; __pyx_temp < (Py_ssize_t)(sizeof(values)/sizeof(values[0])); ++__pyx_temp) {
__Pyx_Arg_XDECREF_FASTCALL(values[__pyx_temp]);
}
}
__Pyx_RefNannyFinishContext();
return __pyx_r;
}
static PyObject *__pyx_pf_54_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13_make_decision_cy(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_signal, __Pyx_memviewslice __pyx_v_alphabet) {
__Pyx_memviewslice __pyx_v_out = { 0, 0, { 0 }, { 0 }, { 0 } };
int __pyx_v_i;
int __pyx_v_j;
int __pyx_v_idx;
CYTHON_UNUSED int __pyx_v_N;
int __pyx_v_M;
double __pyx_v_d;
double __pyx_v_dmin;
PyObject *__pyx_r = NULL;
/* … */
/* function exit code */
__pyx_L1_error:;
__Pyx_XDECREF(__pyx_t_1);
__Pyx_XDECREF(__pyx_t_2);
__Pyx_XDECREF(__pyx_t_4);
__Pyx_XDECREF(__pyx_t_5);
__PYX_XCLEAR_MEMVIEW(&__pyx_t_6, 1);
__Pyx_AddTraceback("_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13.make_decision_cy", __pyx_clineno, __pyx_lineno, __pyx_filename);
__pyx_r = NULL;
__pyx_L0:;
__PYX_XCLEAR_MEMVIEW(&__pyx_v_out, 1);
__Pyx_XGIVEREF(__pyx_r);
__Pyx_RefNannyFinishContext();
return __pyx_r;
}
/* … */
__pyx_tuple__22 = PyTuple_Pack(10, __pyx_n_s_signal, __pyx_n_s_alphabet, __pyx_n_s_out, __pyx_n_s_i, __pyx_n_s_j, __pyx_n_s_idx, __pyx_n_s_N, __pyx_n_s_M, __pyx_n_s_d, __pyx_n_s_dmin); if (unlikely(!__pyx_tuple__22)) __PYX_ERR(0, 10, __pyx_L1_error)
__Pyx_GOTREF(__pyx_tuple__22);
__Pyx_GIVEREF(__pyx_tuple__22);
/* … */
__pyx_t_7 = __Pyx_CyFunction_New(&__pyx_mdef_54_cython_magic_31fd376cc9d559b73513868b53e1b2ba887cac13_1make_decision_cy, 0, __pyx_n_s_make_decision_cy, NULL, __pyx_n_s_cython_magic_31fd376cc9d559b735, __pyx_d, ((PyObject *)__pyx_codeobj__23)); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 10, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_7);
if (PyDict_SetItem(__pyx_d, __pyx_n_s_make_decision_cy, __pyx_t_7) < 0) __PYX_ERR(0, 10, __pyx_L1_error)
__Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
11: @cython.wraparound(False) # array[-1] won't work
12: @cython.nonecheck(False) # variables are never set to None
13: @cython.cdivision(True) # don't protect against dividing by zero
14: def make_decision_cy(double complex[:] signal, double complex[:] alphabet):
15: cdef double complex[:] out
16: cdef int i, j, idx, N, M
17: cdef double d, dmin
+18: N = signal.size
__pyx_t_1 = __pyx_memoryview_fromslice(__pyx_v_signal, 1, (PyObject *(*)(char *)) __pyx_memview_get___pyx_t_double_complex, (int (*)(char *, PyObject *)) __pyx_memview_set___pyx_t_double_complex, 0);; if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 18, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_1); __pyx_t_2 = __Pyx_PyObject_GetAttrStr(__pyx_t_1, __pyx_n_s_size); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 18, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_2); __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0; __pyx_t_3 = __Pyx_PyInt_As_int(__pyx_t_2); if (unlikely((__pyx_t_3 == (int)-1) && PyErr_Occurred())) __PYX_ERR(0, 18, __pyx_L1_error) __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0; __pyx_v_N = __pyx_t_3;
+19: M = alphabet.size
__pyx_t_2 = __pyx_memoryview_fromslice(__pyx_v_alphabet, 1, (PyObject *(*)(char *)) __pyx_memview_get___pyx_t_double_complex, (int (*)(char *, PyObject *)) __pyx_memview_set___pyx_t_double_complex, 0);; if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 19, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_2); __pyx_t_1 = __Pyx_PyObject_GetAttrStr(__pyx_t_2, __pyx_n_s_size); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 19, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_1); __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0; __pyx_t_3 = __Pyx_PyInt_As_int(__pyx_t_1); if (unlikely((__pyx_t_3 == (int)-1) && PyErr_Occurred())) __PYX_ERR(0, 19, __pyx_L1_error) __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0; __pyx_v_M = __pyx_t_3;
+20: out = np.zeros_like(signal)
__Pyx_GetModuleGlobalName(__pyx_t_2, __pyx_n_s_np); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 20, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_2); __pyx_t_4 = __Pyx_PyObject_GetAttrStr(__pyx_t_2, __pyx_n_s_zeros_like); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 20, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_4); __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0; __pyx_t_2 = __pyx_memoryview_fromslice(__pyx_v_signal, 1, (PyObject *(*)(char *)) __pyx_memview_get___pyx_t_double_complex, (int (*)(char *, PyObject *)) __pyx_memview_set___pyx_t_double_complex, 0);; if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 20, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_2); __pyx_t_5 = NULL; __pyx_t_3 = 0; #if CYTHON_UNPACK_METHODS if (unlikely(PyMethod_Check(__pyx_t_4))) { __pyx_t_5 = PyMethod_GET_SELF(__pyx_t_4); if (likely(__pyx_t_5)) { PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_4); __Pyx_INCREF(__pyx_t_5); __Pyx_INCREF(function); __Pyx_DECREF_SET(__pyx_t_4, function); __pyx_t_3 = 1; } } #endif { PyObject *__pyx_callargs[2] = {__pyx_t_5, __pyx_t_2}; __pyx_t_1 = __Pyx_PyObject_FastCall(__pyx_t_4, __pyx_callargs+1-__pyx_t_3, 1+__pyx_t_3); __Pyx_XDECREF(__pyx_t_5); __pyx_t_5 = 0; __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0; if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 20, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_1); __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0; } __pyx_t_6 = __Pyx_PyObject_to_MemoryviewSlice_ds___pyx_t_double_complex(__pyx_t_1, PyBUF_WRITABLE); if (unlikely(!__pyx_t_6.memview)) __PYX_ERR(0, 20, __pyx_L1_error) __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0; __pyx_v_out = __pyx_t_6; __pyx_t_6.memview = NULL; __pyx_t_6.data = NULL;
+21: for i in prange(N, nogil=True):
{
#ifdef WITH_THREAD
PyThreadState *_save;
_save = NULL;
Py_UNBLOCK_THREADS
__Pyx_FastGIL_Remember();
#endif
/*try:*/ {
__pyx_t_3 = __pyx_v_N;
{
#if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
#undef likely
#undef unlikely
#define likely(x) (x)
#define unlikely(x) (x)
#endif
__pyx_t_8 = (__pyx_t_3 - 0 + 1 - 1/abs(1)) / 1;
if (__pyx_t_8 > 0)
{
#ifdef _OPENMP
#pragma omp parallel
#endif /* _OPENMP */
{
#ifdef _OPENMP
#pragma omp for lastprivate(__pyx_v_d) lastprivate(__pyx_v_dmin) firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) lastprivate(__pyx_v_idx) lastprivate(__pyx_v_j)
#endif /* _OPENMP */
for (__pyx_t_7 = 0; __pyx_t_7 < __pyx_t_8; __pyx_t_7++){
{
__pyx_v_i = (int)(0 + 1 * __pyx_t_7);
/* Initialize private variables to invalid values */
__pyx_v_d = ((double)__PYX_NAN());
__pyx_v_dmin = ((double)__PYX_NAN());
__pyx_v_idx = ((int)0xbad0bad0);
__pyx_v_j = ((int)0xbad0bad0);
/* … */
/*finally:*/ {
/*normal exit:*/{
#ifdef WITH_THREAD
__Pyx_FastGIL_Forget();
Py_BLOCK_THREADS
#endif
goto __pyx_L5;
}
__pyx_L5:;
}
}
+22: dmin = 1000.
__pyx_v_dmin = 1000.;
+23: idx = 0
__pyx_v_idx = 0;
+24: for j in range(M):
__pyx_t_9 = __pyx_v_M;
__pyx_t_10 = __pyx_t_9;
for (__pyx_t_11 = 0; __pyx_t_11 < __pyx_t_10; __pyx_t_11+=1) {
__pyx_v_j = __pyx_t_11;
+25: d = cabs(signal[i]-alphabet[j])
__pyx_t_12 = __pyx_v_i;
__pyx_t_13 = __pyx_v_j;
__pyx_v_d = cabs(__Pyx_c_diff_double((*((__pyx_t_double_complex *) ( /* dim=0 */ (__pyx_v_signal.data + __pyx_t_12 * __pyx_v_signal.strides[0]) ))), (*((__pyx_t_double_complex *) ( /* dim=0 */ (__pyx_v_alphabet.data + __pyx_t_13 * __pyx_v_alphabet.strides[0]) )))));
+26: if d < dmin:
__pyx_t_14 = (__pyx_v_d < __pyx_v_dmin);
if (__pyx_t_14) {
/* … */
}
}
+27: idx = j
__pyx_v_idx = __pyx_v_j;
+28: dmin = d
__pyx_v_dmin = __pyx_v_d;
+29: out[i] = idx
__pyx_t_13 = __pyx_v_i;
*((__pyx_t_double_complex *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_13 * __pyx_v_out.strides[0]) )) = __pyx_t_double_complex_from_parts(__pyx_v_idx, 0);
}
}
}
}
}
#if ((defined(__APPLE__) || defined(__OSX__)) && (defined(__GNUC__) && (__GNUC__ > 2 || (__GNUC__ == 2 && (__GNUC_MINOR__ > 95)))))
#undef likely
#undef unlikely
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
#endif
}
+30: return np.array(out)
__Pyx_XDECREF(__pyx_r); __Pyx_GetModuleGlobalName(__pyx_t_4, __pyx_n_s_np); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 30, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_4); __pyx_t_2 = __Pyx_PyObject_GetAttrStr(__pyx_t_4, __pyx_n_s_array); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 30, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_2); __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0; __pyx_t_4 = __pyx_memoryview_fromslice(__pyx_v_out, 1, (PyObject *(*)(char *)) __pyx_memview_get___pyx_t_double_complex, (int (*)(char *, PyObject *)) __pyx_memview_set___pyx_t_double_complex, 0);; if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 30, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_4); __pyx_t_5 = NULL; __pyx_t_8 = 0; #if CYTHON_UNPACK_METHODS if (unlikely(PyMethod_Check(__pyx_t_2))) { __pyx_t_5 = PyMethod_GET_SELF(__pyx_t_2); if (likely(__pyx_t_5)) { PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_2); __Pyx_INCREF(__pyx_t_5); __Pyx_INCREF(function); __Pyx_DECREF_SET(__pyx_t_2, function); __pyx_t_8 = 1; } } #endif { PyObject *__pyx_callargs[2] = {__pyx_t_5, __pyx_t_4}; __pyx_t_1 = __Pyx_PyObject_FastCall(__pyx_t_2, __pyx_callargs+1-__pyx_t_8, 1+__pyx_t_8); __Pyx_XDECREF(__pyx_t_5); __pyx_t_5 = 0; __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0; if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 30, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_1); __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0; } __pyx_r = __pyx_t_1; __pyx_t_1 = 0; goto __pyx_L0;
Pythran#
We could get a speed improvement of a factor of almost 100 over the numpy code using cython. But the code became significantly more complex. Let us look at Pythran now to see what sort of improvements we get with it.
import pythran
%load_ext pythran.magic
Starting point#
We again start from the numpy function. Pythran works a bit differently to Cython. Instead of adding a type annotations to the code we simply use a comment to indicate the types for the parameters.
%%pythran
import numpy as np
#pythran export make_decision_pt(complex128[], complex128[])
def make_decision_pt(signal, alphabet):
out = np.zeros_like(signal)
d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
idx = np.argmin(d, axis=0)
out = alphabet[idx]
return out
Pythran also allows to add multiple annotations for passing different types
%%pythran
import numpy as np
#pythran export make_decision_pt(complex128[], complex128[])
#pythran export make_decision_pt(complex64[], complex64[])
def make_decision_pt(signal, alphabet):
out = np.zeros_like(signal)
d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
idx = np.argmin(d, axis=0)
out = alphabet[idx]
return out
%timeit make_decision_pt(qam64[0], qam64.coded_symbols)
2.41 ms ± 17.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
make_decision_pt(qam64[0].astype(np.complex64), qam64.coded_symbols.astype(np.complex64))
array([-0.15430336+0.77151674j, 0.46291006+1.0801234j ,
0.77151674+0.77151674j, ..., -0.77151674+0.15430336j,
-0.77151674-0.46291006j, -0.46291006-1.0801234j ], dtype=complex64)
Again not much gain over numpy.
However We did not use any compiler optimizations
Try again but use compiler optimizations
%%pythran -O3 -march=native -ffast-math -mfpmath=sse -fopenmp
import numpy as np
#pythran export make_decision_pt(complex128[], complex128[])
def make_decision_pt(signal, alphabet):
out = np.zeros_like(signal)
d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
idx = np.argmin(d, axis=0)
out = alphabet[idx]
return out
%timeit make_decision_pt(qam64[0], qam64.coded_symbols)
2.3 ms ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Thats another factor of 10 over the numpy code. Pythran can even use SIMD instructions. Lets see how much difference it makes here
%%pythran -O3 -march=native -ffast-math -mfpmath=sse -fopenmp -DUSE_XSIMD
import numpy as np
#pythran export make_decision_pt(complex128[], complex128[])
def make_decision_pt(signal, alphabet):
out = np.zeros_like(signal)
d = np.abs(signal[np.newaxis,:]-alphabet[:, np.newaxis])
idx = np.argmin(d, axis=0)
out = alphabet[idx]
return out
%timeit make_decision_pt(qam64[0], qam64.coded_symbols)
2.28 ms ± 2.05 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Let’s see if we can get some more gains by splitting up the loop like in cython. We can also use openmp (multi-threading) support.
%%pythran -O3 -march=native -ffast-math -mfpmath=sse -fopenmp
import numpy as np
#pythran export make_decision_pt2(complex128[], complex128[])
def make_decision_pt2(signal, alphabet):
out = np.zeros_like(signal)
#omp parallel for
for i in range(signal.size): # we only do 1D here
dmin = 1000
idx = 0
for j in range(alphabet.size):
d = abs(signal[i]-alphabet[j])
if d < dmin:
idx = j
dmin = d
out[i] = alphabet[idx]
return out
%timeit make_decision_pt2(qam64[0], qam64.coded_symbols)
298 µs ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Conclusions#
Both cython and pythran can yield significant speed improvements over python or even numpy code. The advantage of cython is that it is very generic and can speed up almost any code. Pythran on the other hand is not as general, but geared numerical applications, for this sort of code it can often yield impressive speed gains with very little adjustment to the original code (often only a comment is required).
Other packages#
Another very popular package for speeding up python code is Numba, which is a just-in-time (JIT) compiler. Numba can also get high speed gains, often very